Getting Started with R

  • Install R and (recommended) R Studio
  • For writing reports, R Markdown is a nice, beautiful format that integrates code, R output, plots, and text.
  • If completely new to programming and R, the R Cookbook is an excellent resource that starts at the very beginning and covers the basics of programming and R functions.

Working with Data

Bodyfat dataset

This is a dataset studying various body measurements as they relate to the percentage of body fat determined by underwater weighing for 252 men.

Body fat is an important health measurement. However, accurate measurement of body fat is inconvenient and costly. Thus, it is desirable to have easy methods of estimating body fat that are not as inconvenient or costly. Body fat can be estimated from tables using age and various skin-fold measurements obtained by using a caliper. Other estimates can be obtained from predictive equations using body circumference measurements (e.g. abdominal circumference) and/or skin-fold measurements.

Loading the dataset

Download the dataset from here or from bcourses. Make sure it is in your current working directory. If not, move the file or change your working directory with setwd().

setwd("~/Dropbox/Berkeley/151A/")
bodyfat<- read.csv("Bodyfat.csv")
head(bodyfat)
##   Density bodyfat Age Weight Height Neck Chest Abdomen   Hip Thigh Knee
## 1  1.0708    12.3  23 154.25  67.75 36.2  93.1    85.2  94.5  59.0 37.3
## 2  1.0853     6.1  22 173.25  72.25 38.5  93.6    83.0  98.7  58.7 37.3
## 3  1.0414    25.3  22 154.00  66.25 34.0  95.8    87.9  99.2  59.6 38.9
## 4  1.0751    10.4  26 184.75  72.25 37.4 101.8    86.4 101.2  60.1 37.3
## 5  1.0340    28.7  24 184.25  71.25 34.4  97.3   100.0 101.9  63.2 42.2
## 6  1.0502    20.9  24 210.25  74.75 39.0 104.5    94.4 107.8  66.0 42.0
##   Ankle Biceps Forearm Wrist
## 1  21.9   32.0    27.4  17.1
## 2  23.4   30.5    28.9  18.2
## 3  24.0   28.8    25.2  16.6
## 4  22.8   32.4    29.4  18.2
## 5  24.0   32.2    27.7  17.7
## 6  25.6   35.7    30.6  18.8

To understand what information the data is giving us, we need to understand what the variable names mean. From the data description, we find that the variables are:

  • Density determined from underwater weighing
  • Percent body fat from Siri’s (1956) equation
  • Age (years)
  • Weight (lbs)
  • Height (inches)
  • Neck circumference (cm)
  • Chest circumference (cm)
  • Abdomen 2 circumference (cm)
  • Hip circumference (cm)
  • Thigh circumference (cm)
  • Knee circumference (cm)
  • Ankle circumference (cm)
  • Biceps (extended) circumference (cm)
  • Forearm circumference (cm)
  • Wrist circumference (cm)

Visualizing Data

Variable Exploration

It is always a good idea to take a look at your data to make sure everything looks reasonable (variable names make sense, measurement units are correct, there are no absurd outliers).

summary(bodyfat)
##     Density         bodyfat           Age            Weight     
##  Min.   :0.995   Min.   : 0.00   Min.   :22.00   Min.   :118.5  
##  1st Qu.:1.041   1st Qu.:12.47   1st Qu.:35.75   1st Qu.:159.0  
##  Median :1.055   Median :19.20   Median :43.00   Median :176.5  
##  Mean   :1.056   Mean   :19.15   Mean   :44.88   Mean   :178.9  
##  3rd Qu.:1.070   3rd Qu.:25.30   3rd Qu.:54.00   3rd Qu.:197.0  
##  Max.   :1.109   Max.   :47.50   Max.   :81.00   Max.   :363.1  
##      Height           Neck           Chest           Abdomen      
##  Min.   :29.50   Min.   :31.10   Min.   : 79.30   Min.   : 69.40  
##  1st Qu.:68.25   1st Qu.:36.40   1st Qu.: 94.35   1st Qu.: 84.58  
##  Median :70.00   Median :38.00   Median : 99.65   Median : 90.95  
##  Mean   :70.15   Mean   :37.99   Mean   :100.82   Mean   : 92.56  
##  3rd Qu.:72.25   3rd Qu.:39.42   3rd Qu.:105.38   3rd Qu.: 99.33  
##  Max.   :77.75   Max.   :51.20   Max.   :136.20   Max.   :148.10  
##       Hip            Thigh            Knee           Ankle     
##  Min.   : 85.0   Min.   :47.20   Min.   :33.00   Min.   :19.1  
##  1st Qu.: 95.5   1st Qu.:56.00   1st Qu.:36.98   1st Qu.:22.0  
##  Median : 99.3   Median :59.00   Median :38.50   Median :22.8  
##  Mean   : 99.9   Mean   :59.41   Mean   :38.59   Mean   :23.1  
##  3rd Qu.:103.5   3rd Qu.:62.35   3rd Qu.:39.92   3rd Qu.:24.0  
##  Max.   :147.7   Max.   :87.30   Max.   :49.10   Max.   :33.9  
##      Biceps         Forearm          Wrist      
##  Min.   :24.80   Min.   :21.00   Min.   :15.80  
##  1st Qu.:30.20   1st Qu.:27.30   1st Qu.:17.60  
##  Median :32.05   Median :28.70   Median :18.30  
##  Mean   :32.27   Mean   :28.66   Mean   :18.23  
##  3rd Qu.:34.33   3rd Qu.:30.00   3rd Qu.:18.80  
##  Max.   :45.00   Max.   :34.90   Max.   :21.40
hist(bodyfat$Biceps)

boxplot(bodyfat$Biceps)

Boxplots are great for identifying potential outliers. We can investigate the bicep outlier further.

outlier<- which.max(bodyfat$Biceps)
bodyfat[outlier,]
##    Density bodyfat Age Weight Height Neck Chest Abdomen   Hip Thigh Knee
## 39  1.0202    35.2  46 363.15  72.25 51.2 136.2   148.1 147.7  87.3 49.1
##    Ankle Biceps Forearm Wrist
## 39  29.6     45      29  21.4

It is sometimes a good idea to remove outliers from data, especially if that observation is an outlier across many different variables. To remove outlier rows, identify the indices of the outliers, then simply use

bodyfat[-outlier,]

Relationships between variables

Scatterplots are useful to explore the relationship between two variables.

plot(bodyfat$Biceps, bodyfat$Forearm)

We can look at all possible relationships using the pairs function.

pairs(bodyfat)

From these scatterplots, we can see many strong correlations. Weight looks correlated with many other variables. Body measurements appear correlated according to body location.

Coplots

How can we make sense of these many correlations? A coplot is a helpful way to visualize the relationship between two variables, conditioned on a third variable.

We might be interested in predicting bodyfat by neck and abdominal circumference.

pairs(~bodyfat + Neck + Abdomen, data = bodyfat)

Both neck and abdominal circumference seem to be good predictors of bodyfat. Are both variables necessary for prediction, or are they telling us the same information?

coplot(bodyfat ~ Neck | Abdomen, data = bodyfat, rows = 1)

Here we see that each scatterplot does not show a clear trend. This indicates that knowing the neck circumference does not give us much more information about bodyfat than we have from knowing abdominal circumference. Thus, if we know the abdomen circumference, the neck circumference does not improve our prediction of bodyfat.